Monitoring app performance. OpenTelemetry & Jaeger

Evgeniy Moiseev
Evocargo
Published in
8 min readJul 19, 2022

--

I currently work on the fleet management system at Evocargo. This system conducts real-time tracking of autonomous vehicles across the map and monitors our service performance.

Evocargo is the developer and provider of an autonomous cargo transportation service. It produces self-driving vehicles and transforms conventional logistics processes through innovative robotics technologies.

The fleet management system

Safety and efficiency are vital in autonomous transportation, so me and my team must be fully aware of everything going on in our complex multi-service infrastructure. Thereby, we will be able to rapidly address any downtime issue. For example, if a warehouse worker requests a vehicle that doesn’t arrive, we must have the instrumentation to detect which of the services has failed us. For this, I’ve implemented an observability solution based on OpenTelemetry and Jaeger tools.

In this article, I’ll share my experience of integrating these tools. First, I’ll describe my requirements for the observability solution. Then, I’ll move on to lay out instructions for the chosen tools so you can follow along and set them up too. Finally, I’ll show you how the data from the services is presented in Jaeger and give you some examples of how to set up the instrumentation for FastAPI and optimize your span storage.

Let’s dive in!

My requirements for the observability solution

To identify a slow component, you can start your app with a profiler, get statistics and build a flame graph. To do this, you need to change the code or start the app in a special mode. This approach works for small projects with a few apps unless you need to scale or analyze data from the production environment.

If your service is complex, with plenty of apps and components like ours at Evocargo, it becomes more challenging to identify the bottleneck. My key requirements for the tracing and monitoring solution were the following:

  • A friendly UI with graphs
  • Batch analysis of data collected in production
  • Visual representation of relations between services, distributed tracing
  • Display of the sequence of requests and their results
  • The ability to control storage usage

The combination of OpenTelemetry & Jaeger met all of them. Here is an extremely simplified path of a request from a fleet operator to a vehicle and a screenshot of how the trace of the request is displayed in the Jaeger UI.

Basic illustration of how spans are sent to Jaeger
Traces of the request displayed in Jaeger and details on the selected trace

Pretty cool! Now let’s take a closer look at these tools and how you can set them up (spoiler: it’s super easy).

Getting started with OpenTelemetry

OpenTelemetry is a set of APIs, SDKs, and other tools that allow us to collect and manage telemetry data (metrics, logs, and traces) and analyze the software’s performance and behavior.

The instrumentation of a Python application is quite simple. I’ve written this code following the instructions in the opentelemetry docs.

# tracing_example.py
import asyncio
import datetime
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
provider = TracerProvider()
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)async def some_work_with_db():
# assuming there will be some sort of query to database
await asyncio.sleep(1)
async def update_vehicle_registry_cache():
# assuming there will be some sort of query to cache
await asyncio.sleep(1)
async def send_update_notification():
# assuming there will be some sort of API call
await asyncio.sleep(1)
async def update_vehicle_registry():
with tracer.start_as_current_span("update_vehicle_registry"):
await some_work_with_db()
with tracer.start_as_current_span("update_vehicle_registry_cache"):
await update_vehicle_registry_cache()
with tracer.start_as_current_span("send_update_notification") as notification_span:
notification_span.set_attribute('sent_at', datetime.datetime.now().isoformat())
await send_update_notification()
asyncio.run(update_vehicle_registry())

When I run this code, it produces the spans to stdout.

{
"name": "send_update_notification",
"context": {
"trace_id": "0xd1ac6cbc6c2be9b68466bb96261b5726",
"span_id": "0x9cbbb9f42d353f9c",
"trace_state": "[]"
},
"kind": "SpanKind.INTERNAL",
"parent_id": "0x45e762c0013b1ed6",
"start_time": "2022-05-26T13:30:02.218383Z",
"end_time": "2022-05-26T13:30:03.220212Z",
"status": {
"status_code": "UNSET"
},
"attributes": {
"sent_at": "2022-05-26T15:30:02.218454"
},
"events": [],
"links": [],
"resource": {
"telemetry.sdk.language": "python",
"telemetry.sdk.name": "opentelemetry",
"telemetry.sdk.version": "1.10.0",
"service.name": "unknown_service"
}
}
{
"name": "update_vehicle_registry_cache",
"context": {
"trace_id": "0xd1ac6cbc6c2be9b68466bb96261b5726",
"span_id": "0x45e762c0013b1ed6",
"trace_state": "[]"
},
"kind": "SpanKind.INTERNAL",
"parent_id": "0x89e2e2b46770e915",
"start_time": "2022-05-26T13:30:01.217480Z",
"end_time": "2022-05-26T13:30:03.220381Z",
"status": {
"status_code": "UNSET"
},
"attributes": {},
"events": [],
"links": [],
"resource": {
"telemetry.sdk.language": "python",
"telemetry.sdk.name": "opentelemetry",
"telemetry.sdk.version": "1.10.0",
"service.name": "unknown_service"
}
}
{
"name": "update_vehicle_registry",
"context": {
"trace_id": "0xd1ac6cbc6c2be9b68466bb96261b5726",
"span_id": "0x89e2e2b46770e915",
"trace_state": "[]"
},
"kind": "SpanKind.INTERNAL",
"parent_id": null,
"start_time": "2022-05-26T13:30:00.216190Z",
"end_time": "2022-05-26T13:30:03.220439Z",
"status": {
"status_code": "UNSET"
},
"attributes": {},
"events": [],
"links": [],
"resource": {
"telemetry.sdk.language": "python",
"telemetry.sdk.name": "opentelemetry",
"telemetry.sdk.version": "1.10.0",
"service.name": "unknown_service"
}
}

Stdout is good, but to produce an enjoyable monitoring experience, I need something more. I want to visualize the data, and that’s where Jaeger comes in.

Setting up Jaeger

I’ll offer a step-by-step description of how to set up Jaeger, so you can just follow along.

  1. Let’s run Jaeger locally via docker as described in docs.
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.34

2. After the container is up, I can access its web interface at http://localhost:16686
An empty Jaeger screen opens — no data from our apps is visible here yet.

An empty Jaeger screen

3. Let’s modify our code slightly so that the required data is sent to Jaeger.

import asyncio
import datetime
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
provider = TracerProvider(resource=Resource.create({SERVICE_NAME: "vehicle_registry"}))
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)async def some_work_with_db():
# assuming there will be some sort of query to database
await asyncio.sleep(1)
async def update_vehicle_registry_cache():
# assuming there will be some sort of query to cache
await asyncio.sleep(1)
async def send_update_notification():
# assuming there will be some sort of API call
await asyncio.sleep(1)
async def update_vehicle_registry():
with tracer.start_as_current_span("update_vehicle_registry"):
await some_work_with_db()
with tracer.start_as_current_span("update_vehicle_registry_cache"):
await update_vehicle_registry_cache()
with tracer.start_as_current_span("send_update_notification") as notification_span:
notification_span.set_attribute('sent_at', datetime.datetime.now().isoformat())
await send_update_notification()
asyncio.run(update_vehicle_registry())

4. Now I can see our services and spans in the Jaeger UI.

Data from our services displayed in Jaeger

5. To see the details, I click on the trace entry.

Trace details

Note that you can add custom attributes to a span. For example, I’ve added the ‘sent_at’ attribute by writing just one line notification_span.set_attribute('sent_at', datetime.datetime.now().isoformat()) and here it is in Jaeger.

Instrumenting FastAPI

Now you know how to send spans to Jaeger with a little script. When you deal with a larger application, or even a set of applications, things might get more complicated. In this case, don’t hesitate to check out what is available in the open-source community — there are plenty of integrations for popular frameworks that you can use off the shelf. For example, an integration for one of the modern Python web-frameworks FastAPI looks quite straightforward:

from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def setup_tracing(*, app: FastAPI, service_name: str, jaeger_agent_host_name: str, jaeger_agent_port: int) -> None:
tracer_provider = TracerProvider(resource=Resource.create({SERVICE_NAME: service_name}))
jaeger_exporter = JaegerExporter(agent_host_name=jaeger_agent_host_name, agent_port=jaeger_agent_port)
tracer_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(tracer_provider)
FastAPIInstrumentor.instrument_app(app=app)

And that’s it. It just works. Everything you need is under the hood in FastAPIInstrumentor.instrument_app(app=app). Just run this function when you initialize your application.

Here is a sample app to show you how a trace can pass through several services:

import asyncio
import os
import uuid
import httpx
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from pydantic import BaseModel
from tracing import setup_tracingapp = FastAPI()setup_tracing(
app=app,
service_name=os.getenv('APP_SERVICE_NAME'),
jaeger_agent_host_name='localhost',
jaeger_agent_port=6831
)
tracer = trace.get_tracer(__name__)async def some_work_with_db():
# assuming there will be some sort of query to database
await asyncio.sleep(1)
class Notification(BaseModel):
notification_id: str
@app.post("/update_vehicle_registry")
async def update_vehicle_registry():
with tracer.start_as_current_span("send_update_notification") as notification_span:
await some_work_with_db()
notification_id = str(uuid.uuid4())
notification_span.set_attribute('notification_id', notification_id)
client = httpx.AsyncClient(base_url='http://127.0.0.1:8001')
HTTPXClientInstrumentor.instrument_client(client)
await client.post('/receive_update_notification', json={'notification_id': notification_id})
@app.post("/receive_update_notification")
async def receive_update_notification(notification: Notification):
with tracer.start_as_current_span("receive_update_notification") as notification_span:
notification_span.set_attribute('notification_id', notification.notification_id)
await asyncio.sleep(1)
  1. Let’s start the first instance of this app on port 8000 and the second instance on port 8001 with different service names.
    APP_SERVICE_NAME=vehicle_registry uvicorn main:app --port 8000
    APP_SERVICE_NAME=notification_receiver uvicorn main:app --port 8001
  2. Then let’s invoke the first app (vehicle_registry) on /update_vehicle_registry
    curl -X 'POST' 'http://127.0.0.1:8000/update_vehicle_registry' -H 'accept: application/json' -d ''
  3. As instrumentation for the app (via FastAPIInstrumentor) and httpx client (via HTTPXClientInstrumentor) has been done already, we can see the entire course of our initial request:
Traces of a request sent from FastAPI

Storage optimization

Spans are created and sent all the time, but you don’t have to save them all. To decrease the load on the storage, you can configure client sampling. If we open the file /etc/jaeger/sampling_strategies.json inside the Jaeger container, we will see:

{
"default_strategy": {
"type": "probabilistic",
"param": 1
}
}

That means that we are saving all traces. If we change 1 to 0.5, only every second trace will be saved. And there are more options for tuning our sampling — you can find them in Jaeger docs. For example, we can turn off traces for our application health checks or even make sampling adaptive.

Service relations

The last thing I would like to mention about Jaeger is the implicit building of directed acyclic graphs between services in your system. Such graphs give you a high-level representation of your whole system:

Service relations displayed in Jaeger

Now you know how to set up OpenTelemetry and Jaeger, add custom attributes for a span, and optimize your storage through sampling. You learned the instrumentation process for a simple Python script and FastAPI application. I hope my instructions have been really easy for you to follow.

However, instrumenting MQTT and KAFKA asynchronous clients is more tricky, as there is no ready-made open-source solution. So my team has developed our own integration that we’ll describe in future articles. Be sure not to miss them!

--

--

Evgeniy Moiseev
Evocargo

Backend lead at Evocargo, fleet management systmes dept